import pandas as pd import numpy as npfrom lets_plot import*# add the additional libraries you need to import for ML herefrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import accuracy_scoreLetsPlot.setup_html(isolated_frame=True)
Show the code
# import your data here using pandas and the URLdf = pd.read_csv("https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv", encoding_errors ="ignore", skiprows=1)df.head()
Unnamed: 0
Response
Response.1
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
Star Wars: Episode I The Phantom Menace.1
...
Yoda
Response.2
Response.3
Response.4
Response.5
Response.6
Response.7
Response.8
Response.9
Response.10
0
3292879998
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
3.0
...
Very favorably
I don't understand this question
Yes
No
No
Male
18-29
NaN
High school degree
South Atlantic
1
3292879538
No
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
Yes
Male
18-29
$0 - $24,999
Bachelor degree
West South Central
2
3292765271
Yes
No
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
NaN
NaN
NaN
1.0
...
Unfamiliar (N/A)
I don't understand this question
No
NaN
No
Male
18-29
$0 - $24,999
High school degree
West North Central
3
3292763116
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5.0
...
Very favorably
I don't understand this question
No
NaN
Yes
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
4
3292731220
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5.0
...
Somewhat favorably
Greedo
Yes
No
No
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
5 rows × 38 columns
Executive Summary
In this project, I used the FiveThirtyEight Star Wars survey data to build a machine learning model that predicts whether a respondent’s household income is at least $50,000 per year. After cleaning and transforming the demographic and survey variables, I trained a Random Forest classifier using an 80/20 train–test split. The model’s accuracy on the test set (shown below in the Results section) indicates that survey responses and demographic factors carry meaningful signal for income prediction, though they are not perfectly predictive. I did not complete the optional stretch portion of the assignment.
Methods
I began by renaming the original survey columns to more descriptive names and filtered the dataset to respondents who had seen at least one Star Wars movie. Age ranges were converted into numeric midpoints, and a separate indicator was created for missing age values. Education levels were mapped into an approximate “years of schooling” variable, and income ranges were converted into household income midpoints, again with a missing-income indicator. After these feature engineering steps, I created a binary target variable indicating whether household income was at least $50,000. Finally, I one-hot encoded all remaining categorical variables (including character favorability and other survey responses) to produce a fully numeric dataset suitable for modeling.
Show the code
# Include and execute your code herenew_col_names = ["respondant_id", "seen_any", "fan_starwars",# Whether seen each movie"seen_epi1","seen_epi2","seen_epi3","seen_epi4","seen_epi5", "seen_epi6",# Movie Ranking (1 = best, 6 = worst)"rank_epi1","rank_epi2","rank_epi3","rank_epi4","rank_epi5", "rank_epi6",# Character favorability rankings"fav_han", "fav_luke", "fav_leia", "fav_anakin", "fav_obi", "fav_palpatine", "fav_darth", "fav_lando", "fav_boba", "fav_c3po", "fav_r2", "fav_jar", "fav_padme", "fav_yoda",# Who Shot first: Han or Greedo?"who_shot_first",# Know about Expanded Universe?"familiar_expanded_universe",# Fan of the Expanded Universe"fan_expanded_universe",# Star Trek fan?"fan_startrek",# Demographic"gender", "age", "income", "educ", "location"]df.columns = new_col_namesseen_cols = ["seen_epi1","seen_epi2","seen_epi3","seen_epi4","seen_epi5", "seen_epi6"]df_seen = df[df[seen_cols].notna().any(axis =1)]# print(df_seen.shape)# filter data set to respondents who have seen at least one moviedf_cleaned = df.dropna(subset = seen_cols, how ="all").reset_index(drop =True)# df_cleaned.head()def age_mapping(range):ifrange=='18-29':return23.5elifrange=='30-44':return37elifrange=='45-60':return52.5elifrange=='>60':return69else: np.nandf_cleaned['age_mid'] = df_cleaned['age'].apply(age_mapping).astype('Float64')# df_cleaned.head()# df_cleaned['age_mid'].unique()df_cleaned['missing_age'] = df_cleaned['age_mid'].isna().astype(int)df_cleaned['age_mid'] = df_cleaned['age_mid'].fillna(0)df_cleaned = df_cleaned.drop(columns ='age')df_cleaned['edu_level'] =0#default is 0df_cleaned.loc[df_cleaned['educ'] =='Less than high school degree', 'edu_level'] =10df_cleaned.loc[df_cleaned['educ'] =='High school degree', 'edu_level'] =12df_cleaned.loc[df_cleaned['educ'] =='Some college or Associate degree', 'edu_level'] =14df_cleaned.loc[df_cleaned['educ'] =='Bachelor degree', 'edu_level'] =16df_cleaned.loc[df_cleaned['educ'] =='Graduate degree', 'edu_level'] =18df_cleaned = df_cleaned.drop(columns ='educ')# df_cleaned.head()def income_mapping(range):ifrange=='$0 - $24,999':return12500elifrange=='$25,000 - $49,999':return37500elifrange=='$100,000 - $149,999':return125000elifrange=='$50,000 - $99,999':return75000elifrange=='$150,000+':return150000else: np.nandf_cleaned['household_income'] = df_cleaned['income'].apply(income_mapping).astype('Float64')df_cleaned['missing_income'] = df_cleaned['household_income'].isna().astype(int)df_cleaned['household_income'] = df_cleaned['household_income'].fillna(0)
Results
After training a Random Forest classifier on the processed dataset, the model achieved an accuracy of 0.647, meaning it correctly predicted whether a respondent earns at least $50,000 about 65% of the time. The feature importance plot shows that demographic variables—particularly education level, age midpoint, and income-related categories—were the strongest predictors, while Star Wars–related responses played a smaller role. The probability distribution plot further illustrates how the model assigns confidence levels, with many predictions clustering near 0 or 1 and a noticeable portion in the uncertain middle range. Overall, the results suggest that demographic information provides meaningful but not decisive predictive power for income classification.
Show the code
# Include and execute your code heredf_cleaned['target'] = (df_cleaned['household_income'] >=50000)*1favorability_cols = []for col in df_cleaned.columns:if col.startswith("fav_"): favorability_cols.append(col)#print(favorability_cols)categorical_cols = favorability_cols.copy()object_columns = df_cleaned.select_dtypes(include ="object").columnsfor col in object_columns:if col notin favorability_cols: categorical_cols.append(col)#print(categorical_cols)df_encoded = pd.get_dummies(df_cleaned, columns = categorical_cols, dtype =int)print('One-hot encode for all remaining categorical columns:')df_encoded.head()
One-hot encode for all remaining categorical columns:
respondant_id
rank_epi1
rank_epi2
rank_epi3
rank_epi4
rank_epi5
rank_epi6
age_mid
missing_age
edu_level
...
income_$50,000 - $99,999
location_East North Central
location_East South Central
location_Middle Atlantic
location_Mountain
location_New England
location_Pacific
location_South Atlantic
location_West North Central
location_West South Central
0
3292879998
3.0
2.0
1.0
4.0
5.0
6.0
23.5
0
12
...
0
0
0
0
0
0
0
1
0
0
1
3292765271
1.0
2.0
3.0
4.0
5.0
6.0
23.5
0
12
...
0
0
0
0
0
0
0
0
1
0
2
3292763116
5.0
6.0
1.0
2.0
4.0
3.0
23.5
0
14
...
0
0
0
0
0
0
0
0
1
0
3
3292731220
5.0
4.0
6.0
2.0
1.0
3.0
23.5
0
14
...
0
0
0
0
0
0
0
0
1
0
4
3292719380
1.0
4.0
3.0
6.0
5.0
2.0
23.5
0
16
...
0
0
0
1
0
0
0
0
0
0
5 rows × 131 columns
Show the code
target ="target"# Prepare the feature (variable) matrixX = df_encoded.drop(columns = ["target", "household_income", "income_$50,000 - $99,999", "income_$25,000 - $49,999","income_$100,000 - $149,999", "income_$150,000+", "income_$0 - $24,999"])y = df_encoded["target"]# Train/test splitX_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)# Build/train the modelmodel = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train,y_train)# Run and evaluate the model on the test datay_pred = model.predict(X_test)print(f'Accuracy: {accuracy_score(y_test,y_pred):.3f}')feature_importances = model.feature_importances_feature_names =list(X.columns)feat_imp_df = pd.DataFrame({"feature": feature_names,"importance": feature_importances}).sort_values("importance", ascending =False)feat_imp_df_top = feat_imp_df.head(10)
Accuracy: 0.647
Show the code
feat_imp_df_top["feature"] = pd.Categorical( feat_imp_df_top["feature"], categories=feat_imp_df_top["feature"], ordered =True)ggplot(feat_imp_df_top, aes(x ="feature", y ="importance"))+\ geom_bar(stat ="identity") +\ coord_flip() +\ labs( x='Feature', y='Importance', title="The Importance for Predicting from Mechine Learning", subtitle='The Prediction from Mechine Learning Based on the Factors', caption='Source: SURVEYMONKEY AUDIENCE' )
Show the code
y_proba = model.predict_proba(X_test)[:,1]proba_df = pd.DataFrame({"probability": y_proba})ggplot(proba_df, aes("probability"))+\ geom_histogram(binwidth =0.05) +\ labs( x='Probability', y='Counts', title="The of Predicting the Audience Make At Least $50k Annually", subtitle='The Annual Income From Whether the Audience or the Family', caption='Source: SURVEYMONKEY AUDIENCE' )
Results
Overall, the Random Forest model provides a reasonable level of accuracy in predicting whether a respondent’s household income is at least $50,000 based on their demographics and Star Wars survey responses. This suggests that patterns in age, education, income categories, and fan attitudes are related to income level, though they clearly do not explain all of the variation. The analysis is limited by self-reported survey data, broad income ranges, and simple modeling choices (one model and default hyperparameters). Future work could compare different algorithms, tune model parameters, and explore additional evaluation metrics (such as precision and recall) to better understand the trade-offs in predicting higher-income respondents.